Xnews Regex
pcre - Perl-compatible regular expressions.
Luu Tran's notes to Xnews users
First, be sure to read the section on regular expression
in the readme.txt. Beginners should look at the regular
expression tutorial here.
You can interactively test regex matching by selecting
View | Test Regex.
Xnews uses Philip Hazel's PCRE package. This document is
a condensed version of the pcre man page. I have removed
the sections on programming interface and differences
between Perl regex and PCRE.
At this point, the only option switch I allow is case
insensitivity. Therefore, when you see mention of
PCRE_xxxxxxxx other than PCRE_CASELESS, you should ignore
the discussion. In addition, since you will be matching
a regex pattern against a single line string (e.g., From
header), you should also ignore the differentiation
between single line and multi-line matching.
DESCRIPTION
The PCRE library is a set of functions that implement reg-
ular expression pattern matching using the same syntax and
semantics as Perl 5, with just a few differences (see
below). The current implementation corresponds to Perl
5.005.
The syntax and semantics of the regular expressions sup-
ported by PCRE are described below. Regular expressions
are also described in the Perl documentation and in a num-
ber of other books, some of which have copious examples.
Jeffrey Friedl's "Mastering Regular Expressions", pub-
lished by O'Reilly (ISBN 1-56592-257-3), covers them in
great detail. The description here is intended as refer-
ence documentation.
A regular expression is a pattern that is matched against
a subject string from left to right. Most characters stand
for themselves in a pattern, and match the corresponding
characters in the subject. As a trivial example, the pat-
tern
The quick brown fox
matches a portion of a subject string that is identical to
itself. The power of regular expressions comes from the
ability to include alternatives and repetitions in the
pattern. These are encoded in the pattern by the use of
meta-characters, which do not stand for themselves but
instead are interpreted in some special way.
There are two different sets of meta-characters: those
that are recognized anywhere in the pattern except within
square brackets, and those that are recognized in square
brackets. Outside square brackets, the meta-characters are
as follows:
\ general escape character with several uses
^ assert start of subject (or line, in multiline
mode)
$ assert end of subject (or line, in multiline
mode)
. match any character except newline (by default)
[ start character class definition
| start of alternative branch
( start subpattern
) end subpattern
? extends the meaning of (
also 0 or 1 quantifier
also quantifier minimizer
* 0 or more quantifier
+ 1 or more quantifier
{ start min/max quantifier
Part of a pattern that is in square brackets is called a
"character class". In a character class the only meta-
characters are:
\ general escape character
] terminates the character class
The following sections describe the use of each of the
meta-characters.
BACKSLASH
The backslash character has several uses. Firstly, if it
is followed by a non-alphameric character, it takes away
any special meaning that character may have. This use of
backslash as an escape character applies both inside and
outside character classes.
For example, if you want to match a "*" character, you
write "\*" in the pattern. This applies whether or not the
following character would otherwise be interpreted as a
meta-character, so it is always safe to precede a non-
alphameric with "\" to specify that it stands for itself.
In particular, if you want to match a backslash, you write
"\\".
If a pattern is compiled with the PCRE_EXTENDED option,
whitespace in the pattern (other than in a character
class) and characters between a "#" outside a character
class and the next newline character are ignored. An
escaping backslash can be used to include a whitespace or
"#" character as part of the pattern.
A second use of backslash provides a way of encoding non-
printing characters in patterns in a visible manner. There
is no restriction on the appearance of non-printing char-
acters, apart from the binary zero that terminates a pat-
tern, but when a pattern is being prepared by text edit-
ing, it is usually easier to use one of the following
escape sequences than the binary character it represents:
\a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any character
\e escape (hex 1B)
\f formfeed (hex 0C)
\n newline (hex 0A)
\r carriage return (hex 0D)
\t tab (hex 09)
\xhh character with hex code hh
\ddd character with octal code ddd, or backreference
The precise effect of "\cx" is as follows: if "x" is a
lower case letter, it is converted to upper case. Then bit
6 of the character (hex 40) is inverted. Thus "\cz"
becomes hex 1A, but "\c{" becomes hex 3B, while "\c;"
becomes hex 7B.
can be in upper or lower case).
After "\0" up to two further octal digits are read. In
both cases, if there are fewer than two digits, just those
that are present are used. Thus the sequence "\0\x\07"
specifies two binary zeros followed by a BEL character.
Make sure you supply two digits after the initial zero if
the character that follows is itself an octal digit.
The handling of a backslash followed by a digit other than
0 is complicated. Outside a character class, PCRE reads
it and any following digits as a decimal number. If the
number is less than 10, or if there have been at least
that many previous capturing left parentheses in the
expression, the entire sequence is taken as a back refer-
ence. A description of how this works is given later, fol-
lowing the discussion of parenthesized subpatterns.
Inside a character class, or if the decimal number is
greater than 9 and there have not been that many capturing
subpatterns, PCRE re-reads up to three octal digits fol-
lowing the backslash, and generates a single byte from the
least significant 8 bits of the value. Any subsequent dig-
its stand for themselves. For example:
\040 is another way of writing a space
\40 is the same, provided there are fewer than 40
previous capturing subpatterns
\7 is always a back reference
\11 might be a back reference, or another way of
writing a tab
\011 is always a tab
\0113 is a tab followed by the character "3"
\113 is the character with octal code 113 (since there
can be no more than 99 back references)
\377 is a byte consisting entirely of 1 bits
\81 is either a back reference, or a binary zero
followed by the two characters "8" and "1"
Note that octal values of 100 or greater must not be
introduced by a leading zero, because no more than three
octal digits are ever read.
All the sequences that define a single byte value can be
used both inside and outside character classes. In addi-
tion, inside a character class, the sequence "\b" is
interpreted as the backspace character (hex 08). Outside a
character class it has a different meaning (see below).
The third use of backslash is for specifying generic char-
acter types:
\s any whitespace character
\S any character that is not a whitespace character
\w any "word" character
\W any "non-word" character
Each pair of escape sequences partitions the complete set
of characters into two disjoint sets. Any given character
matches one, and only one, of each pair.
A "word" character is any letter or digit or the under-
score character, that is, any character which can be part
of a Perl "word". The definition of letters and digits is
controlled by PCRE's character tables, and may vary if
locale- specific matching is taking place (see "Locale
support" above). For example, in the "fr" (French) locale,
some character codes greater than 128 are used for
accented letters, and these are matched by \w.
These character type sequences can appear both inside and
outside character classes. They each match one character
of the appropriate type. If the current matching point is
at the end of the subject string, all of them fail, since
there is no character to match.
The fourth use of backslash is for certain simple asser-
tions. An assertion specifies a condition that has to be
met at a particular point in a match, without consuming
any characters from the subject string. The use of subpat-
terns for more complicated assertions is described below.
The backslashed assertions are
\b word boundary
\B not a word boundary
\A start of subject (independent of multiline mode)
\Z end of subject or newline at end (independent of
multiline mode)
\z end of subject (independent of multiline mode)
These assertions may not appear in character classes (but
note that "\b" has a different meaning, namely the
backspace character, inside a character class).
A word boundary is a position in the subject string where
the current character and the previous character do not
both match \w or \W (i.e. one matches \w and the other
matches \W), or the start or end of the string if the
first or last character matches \w, respectively.
The \A, \Z, and \z assertions differ from the traditional
circumflex and dollar (described below) in that they only
ever match at the very start and end of the subject
string, whatever options are set. They are not affected by
is the last character of the string as well as at the end
of the string, whereas \z matches only at the end.
CIRCUMFLEX AND DOLLAR
Outside a character class, in the default matching mode,
the circumflex character is an assertion which is true
only if the current matching point is at the start of the
subject string. Inside a character class, circumflex has
an entirely different meaning (see below).
Circumflex need not be the first character of the pattern
if a number of alternatives are involved, but it should be
the first thing in each alternative in which it appears if
the pattern is ever to match that branch. If all possible
alternatives start with a circumflex, that is, if the pat-
tern is constrained to match only at the start of the sub-
ject, it is said to be an "anchored" pattern. (There are
also other constructs that can cause a pattern to be
anchored.)
A dollar character is an assertion which is true only if
the current matching point is at the end of the subject
string, or immediately before a newline character that is
the last character in the string (by default). Dollar need
not be the last character of the pattern if a number of
alternatives are involved, but it should be the last item
in any branch in which it appears. Dollar has no special
meaning in a character class.
The meaning of dollar can be changed so that it matches
only at the very end of the string, by setting the
PCRE_DOLLAR_ENDONLY option at compile or matching time.
This does not affect the \Z assertion.
The meanings of the circumflex and dollar characters are
changed if the PCRE_MULTILINE option is set. When this is
the case, they match immediately after and immediately
before an internal "\n" character, respectively, in addi-
tion to matching at the start and end of the subject
string. For example, the pattern /^abc$/ matches the sub-
ject string "def\nabc" in multiline mode, but not other-
wise. Consequently, patterns that are anchored in single
line mode because all branches start with "^" are not
anchored in multiline mode. The PCRE_DOLLAR_ENDONLY option
is ignored if PCRE_MULTILINE is set.
Note that the sequences \A, \Z, and \z can be used to
match the start and end of the subject in both modes, and
if all branches of a pattern start with \A is it always
anchored, whether PCRE_MULTILINE is set or not.
Outside a character class, a dot in the pattern matches
any one character in the subject, including a non-printing
character, but not (by default) newline. If the
PCRE_DOTALL option is set, then dots match newlines as
well. The handling of dot is entirely independent of the
handling of circumflex and dollar, the only relationship
being that they both involve newline characters. Dot has
no special meaning in a character class.
SQUARE BRACKETS
An opening square bracket introduces a character class,
terminated by a closing square bracket. A closing square
bracket on its own is not special. If a closing square
bracket is required as a member of the class, it should be
the first data character in the class (after an initial
circumflex, if present) or escaped with a backslash.
A character class matches a single character in the sub-
ject; the character must be in the set of characters
defined by the class, unless the first character in the
class is a circumflex, in which case the subject character
must not be in the set defined by the class. If a circum-
flex is actually required as a member of the class, ensure
it is not the first character, or escape it with a back-
slash.
For example, the character class [aeiou] matches any lower
case vowel, while [^aeiou] matches any character that is
not a lower case vowel. Note that a circumflex is just a
convenient notation for specifying the characters which
are in the class by enumerating those that are not. It is
not an assertion: it still consumes a character from the
subject string, and fails if the current pointer is at the
end of the string.
When caseless matching is set, any letters in a class rep-
resent both their upper case and lower case versions, so
for example, a caseless [aeiou] matches "A" as well as
"a", and a caseless [^aeiou] does not match "A", whereas a
caseful version would.
The newline character is never treated in any special way
in character classes, whatever the setting of the
PCRE_DOTALL or PCRE_MULTILINE options is. A class such as
[^a] will always match a newline.
The minus (hyphen) character can be used to specify a
range of characters in a character class. For example, [d-
m] matches any letter between d and m, inclusive. If a
minus character is required in a class, it must be escaped
first or last character in the class. It is not possible
to have the character "]" as the end character of a range,
since a sequence such as [w-] is interpreted as a class of
two characters. The octal or hexadecimal representation of
"]" can, however, be used to end a range.
Ranges operate in ASCII collating sequence. They can also
be used for characters specified numerically, for example
[\000-\037]. If a range that includes letters is used when
caseless matching is set, it matches the letters in either
case. For example, [W-c] is equivalent to [][\^_`wxyzabc],
matched caselessly, and if character tables for the "fr"
locale are in use, [\xc8-\xcb] matches accented E charac-
ters in both cases.
The character types \d, \D, \s, \S, \w, and \W may also
appear in a character class, and add the characters that
they match to the class. For example, [\dABCDEF] matches
any hexadecimal digit. A circumflex can conveniently be
used with the upper case character types to specify a more
restricted set of characters than the matching lower case
type. For example, the class [^\W_] matches any letter or
digit, but not underscore.
All non-alphameric characters other than \, -, ^ (at the
start) and the terminating ] are non-special in character
classes, but it does no harm if they are escaped.
VERTICAL BAR
Vertical bar characters are used to separate alternative
patterns. For example, the pattern
gilbert|sullivan
matches either "gilbert" or "sullivan". Any number of
alternatives may appear, and an empty alternative is per-
mitted (matching the empty string). The matching process
tries each alternative in turn, from left to right, and
the first one that succeeds is used. If the alternatives
are within a subpattern (defined below), "succeeds" means
matching the rest of the main pattern as well as the
alternative in the subpattern.
INTERNAL OPTION SETTING
The settings of PCRE_CASELESS, PCRE_MULTILINE,
PCRE_DOTALL, and PCRE_EXTENDED can be changed from within
the pattern by a sequence of Perl option letters enclosed
between "(?" and ")". The option letters are
m for PCRE_MULTILINE
s for PCRE_DOTALL
x for PCRE_EXTENDED
For example, (?im) sets caseless, multiline matching. It
is also possible to unset these options by preceding the
letter with a hyphen, and a combined setting and unsetting
such as (?im-sx), which sets PCRE_CASELESS and PCRE_MULTI-
LINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, is
also permitted. If a letter appears both before and after
the hyphen, the option is unset.
The scope of these option changes depends on where in the
pattern the setting occurs. For settings that are outside
any subpattern (defined below), the effect is the same as
if the options were set or unset at the start of matching.
The following patterns all behave in exactly the same way:
(?i)abc
a(?i)bc
ab(?i)c
abc(?i)
which in turn is the same as compiling the pattern abc
with PCRE_CASELESS set. In other words, such "top level"
settings apply to the whole pattern (unless there are
other changes inside subpatterns). If there is more than
one setting of the same option at top level, the rightmost
setting is used.
If an option change occurs inside a subpattern, the effect
is different. This is a change of behaviour in Perl 5.005.
An option change inside a subpattern affects only that
part of the subpattern that follows it, so
(a(?i)b)c
matches abc and aBc and no other strings (assuming
PCRE_CASELESS is not used). By this means, options can be
made to have different settings in different parts of the
pattern. Any changes made in one alternative do carry on
into subsequent branches within the same subpattern. For
example,
(a(?i)b|c)
matches "ab", "aB", "c", and "C", even though when match-
ing "C" the first branch is abandoned before the option
setting. This is because the effects of option settings
happen at compile time. There would be some very weird
behaviour otherwise.
by using the characters U and X respectively. The (?X)
flag setting is special in that it must always occur ear-
lier in the pattern than any of the additional features it
turns on, even when it is at top level. It is best put at
the start.
SUBPATTERNS
Subpatterns are delimited by parentheses (round brackets),
which can be nested. Marking part of a pattern as a sub-
pattern does two things:
1. It localizes a set of alternatives. For example, the
pattern
cat(aract|erpillar|)
matches one of the words "cat", "cataract", or "caterpil-
lar". Without the parentheses, it would match "cataract",
"erpillar" or the empty string.
2. It sets up the subpattern as a capturing subpattern (as
defined above). When the whole pattern matches, that por-
tion of the subject string that matched the subpattern is
passed back to the caller via the ovector argument of
pcre_exec(). Opening parentheses are counted from left to
right (starting from 1) to obtain the numbers of the cap-
turing subpatterns.
For example, if the string "the red king" is matched
against the pattern
the ((red|white) (king|queen))
the captured substrings are "red king", "red", and "king",
and are numbered 1, 2, and 3.
The fact that plain parentheses fulfil two functions is
not always helpful. There are often times when a grouping
subpattern is required without a capturing requirement. If
an opening parenthesis is followed by "?:", the subpattern
does not do any capturing, and is not counted when comput-
ing the number of any subsequent capturing subpatterns.
For example, if the string "the white queen" is matched
against the pattern
the ((?:red|white) (king|queen))
the captured substrings are "white queen" and "queen", and
are numbered 1 and 2. The maximum number of captured sub-
strings is 99, and the maximum number of all subpatterns,
required at the start of a non-capturing subpattern, the
option letters may appear between the "?" and the ":".
Thus the two patterns
(?i:saturday|sunday)
(?:(?i)saturday|sunday)
match exactly the same set of strings. Because alternative
branches are tried from left to right, and options are not
reset until the end of the subpattern is reached, an
option setting in one branch does affect subsequent
branches, so the above patterns match "SUNDAY" as well as
"Saturday".
REPETITION
Repetition is specified by quantifiers, which can follow
any of the following items:
a single character, possibly escaped
the . metacharacter
a character class
a back reference (see next section)
a parenthesized subpattern (unless it is an assertion -
see below)
The general repetition quantifier specifies a minimum and
maximum number of permitted matches, by giving the two
numbers in curly brackets (braces), separated by a comma.
The numbers must be less than 65536, and the first must be
less than or equal to the second. For example:
z{2,4}
matches "zz", "zzz", or "zzzz". A closing brace on its own
is not a special character. If the second number is omit-
ted, but the comma is present, there is no upper limit; if
the second number and the comma are both omitted, the
quantifier specifies an exact number of required matches.
Thus
[aeiou]{3,}
matches at least 3 successive vowels, but may match many
more, while
\d{8}
matches exactly 8 digits. An opening curly bracket that
appears in a position where a quantifier is not allowed,
or one that does not match the syntax of a quantifier, is
The quantifier {0} is permitted, causing the expression to
behave as if the previous item and the quantifier were not
present.
For convenience (and historical compatibility) the three
most common quantifiers have single-character abbrevia-
tions:
* is equivalent to {0,}
+ is equivalent to {1,}
? is equivalent to {0,1}
It is possible to construct infinite loops by following a
subpattern that can match no characters with a quantifier
that has no upper limit, for example:
(a?)*
Earlier versions of Perl and PCRE used to give an error at
compile time for such patterns. However, because there are
cases where this can be useful, such patterns are now
accepted, but if any repetition of the subpattern does in
fact match no characters, the loop is forcibly broken.
By default, the quantifiers are "greedy", that is, they
match as much as possible (up to the maximum number of
permitted times), without causing the rest of the pattern
to fail. The classic example of where this gives problems
is in trying to match comments in C programs. These appear
between the sequences /* and */ and within the sequence,
individual * and / characters may appear. An attempt to
match C comments by applying the pattern
/\*.*\*/
to the string
/* first command */ not comment /* second comment */
fails, because it matches the entire string due to the
greediness of the .* item.
However, if a quantifier is followed by a question mark,
then it ceases to be greedy, and instead matches the mini-
mum number of times possible, so the pattern
/\*.*?\*/
does the right thing with the C comments. The meaning of
the various quantifiers is not otherwise changed, just the
preferred number of matches. Do not confuse this use of
doubled, as in
\d??\d
which matches one digit by preference, but can match two
if that is the only way the rest of the pattern matches.
If the PCRE_UNGREEDY option is set (an option which is not
available in Perl) then the quantifiers are not greedy by
default, but individual ones can be made greedy by follow-
ing them with a question mark. In other words, it inverts
the default behaviour.
When a parenthesized subpattern is quantified with a mini-
mum repeat count that is greater than 1 or with a limited
maximum, more store is required for the compiled pattern,
in proportion to the size of the minimum or maximum.
If a pattern starts with .* then it is implicitly
anchored, since whatever follows will be tried against
every character position in the subject string. PCRE
treats this as though it were preceded by \A.
When a capturing subpattern is repeated, the value cap-
tured is the substring that matched the final iteration.
For example, after
(tweedle[dume]{3}\s*)+
has matched "tweedledum tweedledee" the value of the cap-
tured substring is "tweedledee". However, if there are
nested capturing subpatterns, the corresponding captured
values may have been set in previous iterations. For exam-
ple, after
/(a|(b))+/
matches "aba" the value of the second captured substring
is "b".
BACK REFERENCES
Outside a character class, a backslash followed by a digit
greater than 0 (and possibly further digits) is a back
reference to a capturing subpattern earlier (i.e. to its
left) in the pattern, provided there have been that many
previous capturing left parentheses.
However, if the decimal number following the backslash is
less than 10, it is always taken as a back reference, and
causes an error only if there are not that many capturing
left of the reference for numbers less than 10. See the
section entitled "Backslash" above for further details of
the handling of digits following a backslash.
A back reference matches whatever actually matched the
capturing subpattern in the current subject string, rather
than anything matching the subpattern itself. So the pat-
tern
(sens|respons)e and \1ibility
matches "sense and sensibility" and "response and respon-
sibility", but not "sense and responsibility". If caseful
matching is in force at the time of the back reference,
then the case of letters is relevant. For example,
((?i)rah)\s+\1
matches "rah rah" and "RAH RAH", but not "RAH rah", even
though the original capturing subpattern is matched case-
lessly.
There may be more than one back reference to the same sub-
pattern. If a subpattern has not actually been used in a
particular match, then any back references to it always
fail. For example, the pattern
(a|(bc))\2
always fails if it starts to match "a" rather than "bc".
Because there may be up to 99 back references, all digits
following the backslash are taken as part of a potential
back reference number. If the pattern continues with a
digit character, then some delimiter must be used to ter-
minate the back reference. If the PCRE_EXTENDED option is
set, this can be whitespace. Otherwise an empty comment
can be used.
A back reference that occurs inside the parentheses to
which it refers fails when the subpattern is first used,
so, for example, (a\1) never matches. However, such ref-
erences can be useful inside repeated subpatterns. For
example, the pattern
(a|b\1)+
matches any number of "a"s and also "aba", "ababaa" etc.
At each iteration of the subpattern, the back reference
matches the character string corresponding to the previous
iteration. In order for this to work, the pattern must be
such that the first iteration does not need to match the
back reference. This can be done using alternation, as in
ASSERTIONS
An assertion is a test on the characters following or pre-
ceding the current matching point that does not actually
consume any characters. The simple assertions coded as \b,
\B, \A, \Z, \z, ^ and $ are described above. More compli-
cated assertions are coded as subpatterns. There are two
kinds: those that look ahead of the current position in
the subject string, and those that look behind it.
An assertion subpattern is matched in the normal way,
except that it does not cause the current matching posi-
tion to be changed. Lookahead assertions start with (?=
for positive assertions and (?! for negative assertions.
For example,
\w+(?=;)
matches a word followed by a semicolon, but does not
include the semicolon in the match, and
foo(?!bar)
matches any occurrence of "foo" that is not followed by
"bar". Note that the apparently similar pattern
(?!foo)bar
does not find an occurrence of "bar" that is preceded by
something other than "foo"; it finds any occurrence of
"bar" whatsoever, because the assertion (?!foo) is always
true when the next three characters are "bar". A lookbe-
hind assertion is needed to achieve this effect.
Lookbehind assertions start with (?<= for positive asser-
tions and (?<! for negative assertions. For example,
(?<!foo)bar
does find an occurrence of "bar" that is not preceded by
"foo". The contents of a lookbehind assertion are
restricted such that all the strings it matches must have
a fixed length. However, if there are several alterna-
tives, they do not all have to have the same fixed length.
Thus
(?<=bullock|donkey)
is permitted, but
ferent length strings are permitted only at the top level
of a lookbehind assertion. This is an extension compared
with Perl 5.005, which requires all branches to match the
same length of string. An assertion such as
(?<=ab(c|de))
is not permitted, because its single branch can match two
different lengths, but it is acceptable if rewritten to
use two branches:
(?<=abc|abde)
The implementation of lookbehind assertions is, for each
alternative, to temporarily move the current position back
by the fixed width and then try to match. If there are
insufficient characters before the current position, the
match is deemed to fail.
Assertions can be nested in any combination. For example,
(?<=(?<!foo)bar)baz
matches an occurrence of "baz" that is preceded by "bar"
which in turn is not preceded by "foo".
Assertion subpatterns are not capturing subpatterns, and
may not be repeated, because it makes no sense to assert
the same thing several times. If an assertion contains
capturing subpatterns within it, these are always counted
for the purposes of numbering the capturing subpatterns in
the whole pattern. Substring capturing is carried out for
positive assertions, but it does not make sense for nega-
tive assertions.
Assertions count towards the maximum of 200 parenthesized
subpatterns.
ONCE-ONLY SUBPATTERNS
With both maximizing and minimizing repetition, failure of
what follows normally causes the repeated item to be re-
evaluated to see if a different number of repeats allows
the rest of the pattern to match. Sometimes it is useful
to prevent this, either to change the nature of the match,
or to cause it fail earlier than it otherwise might, when
the author of the pattern knows there is no point in car-
rying on.
Consider, for example, the pattern \d+foo when applied to
the subject line
After matching all 6 digits and then failing to match
"foo", the normal action of the matcher is to try again
with only 5 digits matching the \d+ item, and then with 4,
and so on, before ultimately failing. Once-only subpat-
terns provide the means for specifying that once a portion
of the pattern has matched, it is not to be re-evaluated
in this way, so the matcher would give up immediately on
failing to match "foo" the first time. The notation is
another kind of special parenthesis, starting with (?> as
in this example:
(?>\d+)bar
This kind of parenthesis "locks up" the part of the pat-
tern it contains once it has matched, and a failure fur-
ther into the pattern is prevented from backtracking into
it. Backtracking past it to previous items, however, works
as normal.
An alternative description is that a subpattern of this
type matches the string of characters that an identical
standalone pattern would match, if anchored at the current
point in the subject string.
Once-only subpatterns are not capturing subpatterns. Sim-
ple cases such as the above example can be though of as a
maximizing repeat that must swallow everything it can. So,
while both \d+ and \d+? are prepared to adjust the number
of digits they match in order to make the rest of the pat-
tern match, (?>\d+) can only match an entire sequence of
digits.
This construction can of course contain arbitrarily com-
plicated subpatterns, and it can be nested.
CONDITIONAL SUBPATTERNS
It is possible to cause the matching process to obey a
subpattern conditionally or to choose between two alterna-
tive subpatterns, depending on the result of an assertion,
or whether a previous capturing subpattern matched or not.
The two possible forms of conditional subpattern are
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
If the condition is satisfied, the yes-pattern is used;
otherwise the no-pattern (if present) is used. If there
are more than two alternatives in the subpattern, a com-
pile-time error occurs.
parentheses consists of a sequence of digits, then the
condition is satisfied if the capturing subpattern of that
number has previously matched. Consider the following pat-
tern, which contains non-significant white space to make
it more readable (assume the PCRE_EXTENDED option) and to
divide it into three parts for ease of discussion:
( \( )? [^()]+ (?(1) \) )
The first part matches an optional opening parenthesis,
and if that character is present, sets it as the first
captured substring. The second part matches one or more
characters that are not parentheses. The third part is a
conditional subpattern that tests whether the first set of
parentheses matched or not. If they did, that is, if sub-
ject started with an opening parenthesis, the condition is
true, and so the yes-pattern is executed and a closing
parenthesis is required. Otherwise, since no-pattern is
not present, the subpattern matches nothing. In other
words, this pattern matches a sequence of non-parentheses,
optionally enclosed in parentheses.
If the condition is not a sequence of digits, it must be
an assertion. This may be a positive or negative lookahead
or lookbehind assertion. Consider this pattern, again con-
taining non-significant white space, and with the two
alternatives on the second line:
(?(?=[^a-z]*[a-z])
\d{2}[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
The condition is a positive lookahead assertion that
matches an optional sequence of non-letters followed by a
letter. In other words, it tests for the presence of at
least one letter in the subject. If a letter is found, the
subject is matched against the first alternative; other-
wise it is matched against the second. This pattern
matches strings in one of the two forms dd-aaa-dd or dd-
dd-dd, where aaa are letters and dd are digits.
COMMENTS
The sequence (?# marks the start of a comment which con-
tinues up to the next closing parenthesis. Nested paren-
theses are not permitted. The characters that make up a
comment play no part in the pattern matching at all.
If the PCRE_EXTENDED option is set, an unescaped # charac-
ter outside a character class introduces a comment that
continues up to the next newline character in the pattern.
Certain items that may appear in patterns are more effi-
cient than others. It is more efficient to use a character
class like [aeiou] than a set of alternatives such as
(a|e|i|o|u). In general, the simplest construction that
provides the required behaviour is usually the most effi-
cient. Jeffrey Friedl's book contains a lot of discussion
about optimizing regular expressions for efficient perfor-
mance.
AUTHOR
Philip Hazel <ph10@cam.ac.uk>
University Computing Service,
New Museums Site,
Cambridge CB2 3QG, England.
Phone: +44 1223 334714
Copyright (c) 1998 University of Cambridge.
Man(1) output converted with
man2html